Introduction

As with normal ciphers, there is a trivial brute-force attack which can find a collision in any hash function . If the hashes produced by the are all of length , then to find a collision we can just evaluate on different inputs. Since the number of possible hashes is only , then at least two inputs must have produced the same hash and our job is done.

Usually, we are not particularly worried about this attack because it takes steps to execute. However, it turns out that there is a much more efficient attack which can find a collision against any hash function.

The Birthday Paradox

To illustrate the attack we are going to answer the following question: given people in a room, what is the probability that two of them share a birthday? One should see how this is equivalent to asking what is the likelihood that from messages two produce a collision in the hash function .

We assume that each birthday date is equally likely and that we are only working with the possible birthdays in a non-leap year. The probability that two people share the same birthday is the same as the negation of the probability that no people share a birthday, i.e.
the probability of a collision is the negation of the probability that there is no collision amongst the messages .

Imagine the people entering the room one by one (or equivalently, the messages being generated independently one after the other). The probability that there is no collision in the birthdays of the people is the probability that there is no collision in the birthdays of the first people and that the -th person's birthday also does not collide with the previous birthdays, i.e.

This is true because if there were no collisions in the first people, then there must be unique birthdays and so the probability that the -th person's birthday is also unique is . This logic can be continued until we reach the first person. Therefore,

The 1 at the beginning represents the probability that the first person's birthday does not collide with someone's else when entering the room, which is 100%, since there are no other people in the room until the first one enters. This probability can be rewritten as the following product:

Therefore, the probability that a collision does occur can be written as

We are now going to use a well-known inequality (we are going to take it for granted because proving it is out of scope), namely that . Plugging in for , we get that

What is nice about exponential functions with the same base is that when multiplying them, the exponents simply add, yielding

The function is always greater than for positive integers and so we have

Recall that the left-hand side is smaller than the probability of a collision. Therefore,

While we did not obtain an exact equation for the value of , we did obtain a lower bound for it!

Given $q$ elements which are uniformly and independently chosen from a set of $B$ possible elements, the probability that two elements are the same is at least $1 - e^{-\frac{q^2}{B}}$.

Now let's put the theorem to work. How many people do we need in the room in order for there to be 50% chance that two of them share a birthday? Well, plug in and set

Solving this equation yields . We need only 23 people for there to be a 50% chance of two of them sharing a birthday!

Naive Birthday Attack

If we have a hash function with outputs of length , then in order to have a 50% chance of a collision, we need different messages (this can be obtained from the Birthday theorem bound by setting ).

The naive birthday attack does precisely this. First, it chooses different messages . It then computes their hashes . Finally, it looks for a collision amongst these hashes . With probability approximately it is going to find such a collision. If it does not, it simply starts over. On average, this attack is going to need just 2 iterations to get a colliding pair and its running time is . Compare that to the brute-force approach whose running time was .

This variation is called naive because it has a huge space complexity, namely , since the algorithm will have to store all the computed hashes while checking them for collisions.

Since the birthday attack is universal and works for any hash function, it is used instead of the simple brute force attack as the gold standard when creating security proofs.

Small-Space Birthday Attack

There is an improved version of the birthday attack which has approximately the same probability success and running time but only takes a constant amount of memory. This attack uses Floyd's cycle finding algorithm.

Begin by choosing a random initial message and set . At the -th iteration compare the values and . If , then we know that there must have been a collision somewhere along the way - it might simply happen that , in which case we would have immediately found the collision pair . However, it could very well be the case that and so the actual collision, i.e. the two different inputs that produced the same hash, happened earlier. Since we did not store all of the hashes we burnt through, we will need to iterate over them again to find precisely which ones collide.

Store the index for which we found that and reset to the initial value . This time we will iterate until . At each step , we check if and if it is, we have our collision - simply return and . Otherwise, we set and .

fn SmallSpaceBirthdayAttack()
{
	let x_0 = random_binary_string();
	let x = x_0;
	let x' = x_0;
	let i = 0;
	
	while(true)
	{
		x = H(x);
		x' = H(H(x'));
		
		if (x = x')
		{
			break;
		}
		else
		{
			++i;
		}
	}
	
	let x = x_0;
	let x' = x_0;
	
	for(let j = 0; j < i; ++j)
	{
		if (H(x) = H(x'))
		{
			return (x, x');
		}
		else
		{
			x = H(x);
			x' = H(x');
		}
	}
}

This attack uses much less memory than the naive method because it only needs to store the initial value as well as the two values and which are being checked at each iteration. As before, we have a chance of finding a collision within the first hashes we check.